Search CORE

86 research outputs found

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Author: Fang Bo
Luo Haipeng
Ouyang Wanli
Wang Jingdong
Wu Wenhao
Publication venue
Publication date: 28/03/2023
Field of study

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform cross-modal feature interaction between the video and caption to produce enhanced video representations. iii) Output score: the Query-Caption matching branch can complement the original Query-Video matching branch for text-video retrieval. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach. Without any post-processing, Cap4Video achieves state-of-the-art performance on four standard text-video retrieval benchmarks: MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is available at https://github.com/whwu95/Cap4Video .Comment: Accepted by CVPR 2023. Selected as a Highlight (Top 2.5% of ALL submissions

arXiv.org e-Print Archive

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

Author: Luo Haipeng
Ouyang Wanli
Wang Jingdong
Wang Xiaohan
Wu Wenhao
Yang Yi
Publication venue
Publication date: 25/03/2023
Field of study

Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE .Comment: Accepted by CVPR 202

arXiv.org e-Print Archive

ICS protocol fuzzing: Coverage guided packet crack and generation

Author: Chang Wanli
Jiang Yu
Jiao Xun
Luo Zhengxiong
Shen Yuheng
Zuo Feilong
Publication venue
Publication date
Field of study

White Rose Research Online

3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

Author: Jin Sheng
Lin Mengxiang
Liu Wentao
Luo Ping
Meng Hao
Ouyang Wanli
Qian Chen
Publication venue
Publication date: 22/07/2022
Field of study

Estimating 3D interacting hand pose from a single RGB image is essential for understanding human actions. Unlike most previous works that directly predict the 3D poses of two interacting hands simultaneously, we propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately. In this way, it is straightforward to take advantage of the latest research progress on the single-hand pose estimation system. However, hand pose estimation in interacting scenarios is very challenging, due to (1) severe hand-hand occlusion and (2) ambiguity caused by the homogeneous appearance of hands. To tackle these two challenges, we propose a novel Hand De-occlusion and Removal (HDR) framework to perform hand de-occlusion and distractor removal. We also propose the first large-scale synthetic amodal hand dataset, termed Amodal InterHand Dataset (AIH), to facilitate model training and promote the development of the related research. Experiments show that the proposed method significantly outperforms previous state-of-the-art interacting hand pose estimation approaches. Codes and data are available at https://github.com/MengHao666/HDR.Comment: ECCV202

arXiv.org e-Print Archive

Terahertz Sensor via Ultralow-Loss Dispersion-Flattened Polymer Optical Fiber: Design and Analysis

Author: Cao Lei
Copner Nigel
Gong Yongkang
Jiang Peng
Jones Adam
Li Kang
Luo Wanli
Xu Qiang
Publication venue: 'MDPI AG'
Publication date: 01/08/2021
Field of study

A novel cyclic olefin copolymer (COC)-based polymer optical fiber (POF) with a rectangular porous core is designed for terahertz (THz) sensing by the finite element method. The numerical simulations showed an ultrahigh relative sensitivity of 89.73% of the x-polarization mode at a frequency of 1.2 THz and under optimum design conditions. In addition to this, they showed an ultralow confinement loss of 2.18 × 10−12 cm−1, a high birefringence of 1.91 × 10−3, a numerical aperture of 0.33, and an effective mode area of 1.65 × 105 μm2 was obtained for optimum design conditions. Moreover, the range dispersion variation was within 0.7 ± 0.41 ps/THz/cm, with the frequency range of 1.0–1.4 THz. Compared with the traditional sensor, the late-model sensor will have application value in THz sensing and communication

University of South Wales Research Explorer

Directory of Open Access Journals